AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.70)

Neural Information Processing SystemsFeb-11-2026, 08:11:35 GMT

CEDe: Acollectionofexpert-curateddatasetswith atom-levelentityannotationsforOpticalChemical StructureRecognition

Optical Chemical Structure Recognition (OCSR) dealswiththetranslation from chemical imagestomolecular structures, thisbeingthemainwaychemical compounds are depicted in scientific documents.

artificial intelligence, dataset, machine learning, (18 more...)

Country: North America > United States > California > Orange County > Irvine (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Neural Information Processing SystemsDec-24-2025, 23:43:25 GMT

CEDe: A collection of expert-curated datasets with atom-level entity annotations for Optical Chemical Structure Recognition

Optical Chemical Structure Recognition (OCSR) deals with the translation from chemical images to molecular structures, this being the main way chemical compounds are depicted in scientific documents. Traditionally, rule-based methods have followed a framework based on the detection of chemical entities, such as atoms and bonds, followed by a compound structure reconstruction step. Recently, neural architectures analog to image captioning have been explored to solve this task, yet they still show to be data inefficient, using millions of examples just to show performances comparable with traditional methods. Looking to motivate and benchmark new approaches based on atomic-level entities detection and graph reconstruction, we present CEDe, a unique collection of chemical entity bounding boxes manually curated by experts for scientific literature datasets. These annotations combine to more than 700,000 chemical entity bounding boxes with the necessary information for structure reconstruction. Also, a large synthetic dataset containing one million molecular images and annotations is released in order to explore transfer-learning techniques that could help these architectures perform better under low-data regimes. Benchmarks show that detection-reconstruction based models can achieve performances on par with or better than image captioning-like models, even with 100x fewer training examples.

artificial intelligence, machine learning, proceedings, (9 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.59)

Neural Information Processing SystemsAug-17-2025, 16:05:14 GMT

ada36dfeb684a5c11f783fc170c294fe-Supplemental-Datasets_and_Benchmarks.pdf

artificial intelligence, dataset, machine learning, (18 more...)

Industry: Law (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsAug-17-2025, 16:05:10 GMT

CEDe: A collection of expert-curated datasets with atom-level entity annotations for Optical Chemical Structure Recognition

Chemical information in scientific literature is commonly presented in various ways, such as text, tables, charts, and images.

machine learning, natural language, pattern recognition, (18 more...)

Country:

North America > United States > California > Orange County > Irvine (0.04)
Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Industry:

Law > Intellectual Property & Technology Law (0.69)
Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.93)
(2 more...)

Khodadad, Mohammad, Kasmaee, Ali Shiraee, Astaraki, Mahdi, Sherck, Nicholas, Mahyar, Hamidreza, Samiee, Soheila

Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study

arXiv.org Artificial IntelligenceAug-7-2025

In this study, we introduced a new benchmark consisting of a curated dataset and a defined evaluation process to assess the compositional reasoning capabilities of large language models within the chemistry domain. We designed and validated a fully automated pipeline, verified by subject matter experts, to facilitate this task. Our approach integrates OpenAI reasoning models with named entity recognition (NER) systems to extract chemical entities from recent literature, which are then augmented with external knowledge bases to form a comprehensive knowledge graph. By generating multi-hop questions across these graphs, we assess LLM performance in both context-augmented and non-context augmented settings. Our experiments reveal that even state-of-the-art models face significant challenges in multi-hop compositional reasoning. The results reflect the importance of augmenting LLMs with document retrieval, which can have a substantial impact on improving their performance. However, even perfect retrieval accuracy with full context does not eliminate reasoning errors, underscoring the complexity of compositional reasoning. This work not only benchmarks and highlights the limitations of current LLMs but also presents a novel data generation pipeline capable of producing challenging reasoning datasets across various domains. Overall, this research advances our understanding of reasoning in computational linguistics.

large language model, machine learning, natural language, (15 more...)

2504.16414

Country: North America (0.28)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Industry:

Materials > Chemicals > Commodity Chemicals > Petrochemicals (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceFeb-26-2025

Chemical knowledge-informed framework for privacy-aware retrosynthesis learning

Chen, Guikun, Zhang, Xu, Yang, Yi, Wang, Wenguan

Chemical reaction data is a pivotal asset, driving advances in competitive fields such as pharmaceuticals, materials science, and industrial chemistry. Its proprietary nature renders it sensitive, as it often includes confidential insights and competitive advantages organizations strive to protect. However, in contrast to this need for confidentiality, the current standard training paradigm for machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries and frequent data transmission between entities, potentially exposing proprietary information to unauthorized access or interception during storage and transfer. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models. CKIF enables distributed training across multiple chemical organizations without compromising the confidentiality of proprietary reaction data. Instead of gathering raw reaction data, CKIF learns retrosynthesis models through iterative, chemical knowledge-informed aggregation of model parameters. In particular, the chemical properties of predicted reactants are leveraged to quantitatively assess the observable behaviors of individual models, which in turn determines the adaptive weights used for model aggregation. On a variety of reaction datasets, CKIF outperforms several strong baselines by a clear margin (e.g., ~20% performance improvement over FedAvg on USPTO-50K), showing its feasibility and superiority to stimulate further research on privacy-preserving retrosynthesis.

accuracy, ckif, reaction data, (14 more...)

2502.19119

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Energy (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Neural Information Processing SystemsJan-18-2025, 13:08:29 GMT

CEDe: A collection of expert-curated datasets with atom-level entity annotations for Optical Chemical Structure Recognition

atom-level entity annotation, expert-curated dataset, optical chemical structure recognition, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.62)

arXiv.org Artificial IntelligenceOct-19-2024

MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

Kim, Junho, Kim, Yeachan, Park, Jun-Hyung, Oh, Yerim, Kim, Suho, Lee, SangKeun

We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.

machine learning, materials science, natural language, (20 more...)

2410.15126

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Langer, Stefan, Neuhaus, Fabian, Nürnberger, Andreas

CEAR: Automatic construction of a knowledge graph of chemical entities and roles from scientific literature

arXiv.org Artificial IntelligenceJul-31-2024

Ontologies are formal representations of knowledge in specific domains that provide a structured framework for organizing and understanding complex information. Creating ontologies, however, is a complex and time-consuming endeavor. ChEBI is a well-known ontology in the field of chemistry, which provides a comprehensive resource for defining chemical entities and their properties. However, it covers only a small fraction of the rapidly growing knowledge in chemistry and does not provide references to the scientific literature. To address this, we propose a methodology that involves augmenting existing annotated text corpora with knowledge from Chebi and fine-tuning a large language model (LLM) to recognize chemical entities and their roles in scientific text. Our experiments demonstrate the effectiveness of our approach. By combining ontological knowledge and the language understanding capabilities of LLMs, we achieve high precision and recall rates in identifying both the chemical entities and roles in scientific literature. Furthermore, we extract them from a set of 8,000 ChemRxiv articles, and apply a second LLM to create a knowledge graph (KG) of chemical entities and roles (CEAR), which provides complementary information to ChEBI, and can help to extend it.

chemical entity, chemical entity and role, relation, (13 more...)

2407.21708

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Europe > Germany > Saxony-Anhalt > Magdeburg (0.05)
North America > United States > Colorado (0.04)
Europe > Netherlands (0.04)

Genre: Research Report (1.00)

Industry:

Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.95)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)